A Scalable System for Identifying Co-derivative Documents
نویسندگان
چکیده
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying coderivative clusters, and describe deco, a prototype system which makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.
منابع مشابه
Accurate discovery of co-derivative documents via duplicate text detection
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences,...
متن کاملMethods for Identifying Versioned and Plagiarised Documents
The widespread use of online publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarising the work of others. We evaluate two families of methods for searching a collection to find documents that are co-derivative, that is, are versions or plagiarisms of each other. The first, the ranking ...
متن کاملIdentifying Similar and Co-referring Documents Across Languages
This paper presents a methodology for finding similarity and co-reference of documents across languages. The similarity between the documents is identified according to the content of the whole document and co-referencing of documents is found by taking the named entities present in the document. Here we use Vector Space Model (VSM) for identifying both similarity and co-reference. This can be ...
متن کاملA Bibliometric Analysis of Open Strategy: A new Concept in Strategic Management
Strategy development has traditionally been an exclusive and secretive matter. However, some organizations have recently used IT to enable openness for making a strategy. The aim of this paper was to research the trends of open strategy by applying bibliometric mapping. The method involves identifying open strategy-related documents, including a sample of 1717 existing documents from 2000 to 20...
متن کاملDrawing Co-Citation Networks of Corona Virus Studies
Background and Aim: The purpose of the present study is to map the coronavirus domain citation network to better understand this domain based on all other citation networks. Materials and Methods: The present study is applied in terms of purpose, and is descriptive scientometrics in terms of type, which has been done with the all-citation method. In this study, all scientific publications on ...
متن کامل